Фильтрация ошибок функция query синтаксис python - Ремонт и установка крупной бытовой техники

This tutorial will show you how to use the Pandas query method to subset your data.

The tutorial will explain the syntax and also show you step-by-step examples of how to use the Pandas query method.

If you need something specific (like help with syntax, examples, etc), you can click on one of the following links and it will take you to the appropriate section.

Contents:

A Quick Review of Pandas
Introduction to Pandas Query
The syntax of Pandas Query
Pandas Query Examples
Pandas Query FAQ

But if you’re new to Pandas, or new to data manipulation in Python, I recommend that you read the whole tutorial. Everything will make more sense that way.

Ok …. let’s get to it.

A Quick Review of Pandas

Very quickly, let’s review what Pandas is.

Pandas is a package for the Python programming language.

Specifically, Pandas is a toolkit for performing data manipulation in Python. It is a critical toolkit for doing data science in Python.

Pandas works with DataFrames

To get a little more specific, Pandas is a toolkit for creating and working with a data structure called a DataFrame.

A DataFrame is a structure that we use to store data.

DataFrames have a row-and-column structure, like this:

A simple example of a pandas dataframe.

If you’ve worked with Microsoft Excel, you should be familiar with this structure. A Pandas DataFrame is very similar to an Excel spreadsheet, in that a DataFrame has rows, columns, and cells.

There are several ways to create a DataFrame, including importing data from an external file (like a CSV file); and creating DataFrames manually from raw data using the pandas.DataFrame() function.

For more information about DataFrames, check out our tutorial on Pandas DataFrames.

Pandas methods perform operations on DataFrames

Once you have your data inside of a dataframe, you’ll very commonly need to perform data manipulation.

If your data are a little “dirty,” you might need to use some tools to clean the data up: modifying missing values, changing string names, renaming variables, adding variables, etc.

Moreover, once your data are in the DataFrame structure and the data are “clean,” you’ll still need to use some “data manipulation” techniques to analyze your data. Here, I’m talking about things like subsetting, grouping, and aggregating.

Pandas has tools for performing all of these tasks. It is a comprehensive toolkit for working with data and performing data manipulation on DataFrames.

A quick introduction to Pandas query

Among the many tools for performing data manipulation on DataFrames is the Pandas query method.

What is .query() and what does it do?

Query is a tool for querying dataframes and retrieving subsets

At a very high level, the Pandas query method is a tool for generating subsets from a Pandas DataFrame.

For better or worse, there are actually several ways to generate subsets with Pandas.

The loc and iloc methods enable you to retrieve subsets based on row and column labels or by integer index of the rows and columns.

And Pandas has a bracket notation that enables you to use logical conditions to retrieve specific rows of data.

But both of those tools can be a little cumbersome syntactically. Moreover, they are hard to use in conjunction with other data manipulation methods in a smooth, organic way.

In many ways, the Pandas .query method solves those problems.

Query enables you to “query” a DataFrame and retrieve subsets based on logical conditions.

Moreover, the syntax is a little more streamlined than Pandas bracket notation.

Additionally, the Pandas query method can be used with other Pandas methods in a streamlined way that makes data manipulation smooth and straightforward. I’ll show you a little example of that later in the tutorial.

But before we get there, let’s first take a look at the syntax of Pandas query.

The syntax of Pandas query is mostly straightforward.

In order to use this method though, you’ll need to have a Pandas DataFrame.

What that means is that you’ll need to import Pandas and use Pandas to create a DataFrame.

For more information about how to create DataFrames, you can read our introductory tutorial on Pandas DataFrames.

Pandas query syntax

Ok.

Assuming you have a DataFrame, you need to call .query() using “dot syntax”.

Basically, type the name of the DataFrame you want to subset, then type a “dot”, and then type the name of the method …. query().

Like this:

This image shows an explanation of the syntax of the Pandas query method.

In the above syntax explanation, I’m assuming that you have a DataFrame named yourDataFrame.

Then, inside of the query method, there are a few parameters and arguments to the function.

Let’s walk through those.

The parameters of pandas query

Inside of the function, there are really only two arguments or parameters that you need to know about:

expression
inplace

Let’s talk about each of them individually.

`expression` (required)

Here, the expression is some sort of logical expression that describes which rows to return in the output.

If the expression is true for a particular row, the row will be included in the output. If the expression is false for a particular row, that row will be excluded from the output. I’ll show you several examples of these expressions in the examples section of this tutorial.

One note: the expression itself must be presented as a Python string. That means that the expression must be enclosed inside of quotations … either double quotations or single quotations.

Keep in mind, that you may also need to use strings inside of the expression itself. For example, if you need to reference a category called “Dog” inside of your logical expression, and you enclose that category inside of double quotation marks, you’ll need to enclose the overall expression inside of single quotes. If you’re not familiar with this, please review how strings work in Python, and review how to reference strings within strings.

`inplace`

The inplace parameter enables you to specify if you want to directly modify the DataFrame you’re working with.

Remember from the syntax section, when we use the .query() method, we need to type the name of the DataFrame that we want to subset. That DataFrame will serve as the input of the query method.

However, by default, the query method will produce a new DataFrame as the output, and will leave the original DataFrame unchanged.

That’s because by default, the inplace parameter is set to inplace = False. What that means is that query will not modify the original DataFrame “in place”. Instead it creates a new DataFrame. That is literally what the inplace parameter means.

But we can change that behavior by setting inplace = True.

If we override the default and set inplace = True, query will modify the original DataFrame “in place”.

But be careful … if you do this you will overwrite your original DataFrame. Make sure that your code is working the way you want it to!

Examples: how to use .query() to subset a Pandas dataframe

Ok, now that you’ve learned how the syntax works, let’s take a look at some examples.

Examples:

Subset a pandas dataframe based on a numeric variable
Select rows based on a categorical variable
Subset a DataFrame by index
Subset a pandas dataframe by comparing two columns
Select rows based on multiple conditions
Reference local variables inside of query
Modify a DataFrame in Place

Run this code first

Before we actually work with the examples, we need to run some preliminary code.

We’re going to import Pandas and create a dataframe.

Import Pandas

First, let’s just import Pandas.

This is fairly straightforward, but if you’re a beginner, you might not be familiar with it.

To call any function from Pandas (or any other package), we need to first import the package. We import a package with the import statement.

Moreover, we can import a package with the original name (i.e., import pandas). But we also have the option of importing a package with a “nickname.” That’s a very common way of doing things, and that’s what we’re going to do here.

We’re going to import Pandas with the nickname “pd“.

import pandas as pd

Create Data Frame

We also need a dataframe to work with.

Here, we’ll use a create a DataFrame with some dummy sales data.

To do this, we’re just using the pd.DataFrame function from Pandas.

sales_data = pd.DataFrame({"name":["William","Emma","Sofia","Markus","Edward","Thomas","Ethan","Olivia","Arun","Anika","Paulo"]
,"region":["East","North","East","South","West","West","South","West","West","East","South"]
,"sales":[50000,52000,90000,34000,42000,72000,49000,55000,67000,65000,67000]
,"expenses":[42000,43000,50000,44000,38000,39000,42000,60000,39000,44000,45000]})

And if we print it out, we can see the dataset.

       name region  sales  expenses
0   William   East  50000     42000
1      Emma  North  52000     43000
2     Sofia   East  90000     50000
3    Markus  South  34000     44000
4    Edward   West  42000     38000
5    Thomas   West  72000     39000
6     Ethan  South  49000     42000
7    Olivia   West  55000     60000
8      Arun   West  67000     39000
9     Anika   East  65000     44000
10    Paulo  South  67000     45000

Notice that the DataFrame has four variables: name, region, sales, and expenses.

We’re going to use these variables (and the row index) to subset our data with .query().

EXAMPLE 1: Subset a pandas dataframe based on a numeric variable

Let’s start with a very simple example.

Here, we’re going to subset the data on a numeric variable: sales.

We’re going to retrieve the rows in the DataFrame where sales is greater than 60000.

To do this, we’re simply providing a logical expression inside of the parenthesis: 'sales > 60000'.

Let’s run the code so you can see the output.

sales_data.query('sales > 60000')

OUT:

      name region  sales  expenses
2    Sofia   East  90000     50000
5   Thomas   West  72000     39000
8     Arun   West  67000     39000
9    Anika   East  65000     44000
10   Paulo  South  67000     45000

Notice that when we run the code, it returns 5 rows. The original DataFrame has 11 rows, but the output has 5.

Moreover, all of the rows of data in the output meet the criteria defined in the logical expression 'sales > 60000'. All of the records have sales greater than 60000.

This is really straightforward. The query method used the expression 'sales > 60000' as a sort of filtering criteria. If the criteria is true for a particular row, that row is included in the output. If the criteria is false for a particular row, it is excluded.

Now notice that in this simple example, we used the greater-than sign to filter on the sales variable. This is one thing to do, but we could also test for equivalence (==), test for greater-than-or-equal, lest-than, etc. Almost any comparison operator will work.

And as you’ll see in upcoming examples, we can combine expressions using logical operators to make more complex expressions.

EXAMPLE 2: Select rows based on a categorical variable

Next, we’re going to select a group of rows by filtering on a categorical variable.

We’re going to retrieve all of the rows where region is equal to East.

To do this, we’re going to call the .query() method using “dot notation.” We simply type the name of our DataFrame, sales_data, and then type the name of the method, .query().

Inside of the parenthesis of the query method, we have a logical expression: 'region == "East"'.

Let’s run the code and take a look at the output.

Here is the code:

sales_data.query('region == "East"')

And here is the output, which is a new dataframe:

      name region  sales
0  William   East  50000
2    Sofia   East  90000
9    Anika   East  65000

Notice that the output DataFrame contains all of the records for the region East.

A few notes about this.

First, the logical expression is syntactically just a Python logical expression. Here, we’re using the equivalence operator from Python, the double equal sign (==).

Second, the whole logical expression is contained inside of single quotation marks. That is, we provide the logical expression to .query() in the form of a string. Query expects a string.

Third, we’re referencing "East", which is one of the unique values of the region variable. Notice that this value is actually contained inside of double quotation marks. This is because we treat string values of a DataFrame as strings, and as such, it needs to be inside of quotation marks. But since the overall expression is already inside of single quotes, we need to use double quotes for the value "East". If you don’t understand this, you need to review the rules for using quotes inside of quotes with Python.

EXAMPLE 3: Subset a DataFrame by index

Here, we’re going to reference the index of the DataFrame and subset the rows based on that index.

First, let’s just print out the DataFrame.

print(sales_data)

OUT:

       name region  sales  expenses
0   William   East  50000     42000
1      Emma  North  52000     43000
2     Sofia   East  90000     50000
3    Markus  South  34000     44000
4    Edward   West  42000     38000
5    Thomas   West  72000     39000
6     Ethan  South  49000     42000
7    Olivia   West  55000     60000
8      Arun   West  67000     39000
9     Anika   East  65000     44000
10    Paulo  South  67000     45000

Notice that when we print out the DataFrame, each row has an integer associated with it on the left hand side, starting at zero.

This group of numbers is the index. By default, when we create a DataFrame, each row will be given a numeric index value like this (although, there are ways to change the index to something else).

We can reference these index values inside of query.

To do this, just use the word ‘index’.

Here’s an example. Here, we’re going to return the rows where index is less than 3. This will effectively return the first three rows.

sales_data.query('index < 3')

OUT:

      name region  sales  expenses
0  William   East  50000     42000
1     Emma  North  52000     43000
2    Sofia   East  90000     50000

Notice that in this output, the index values (on the left side of the printout) are all less than 3.

You can also use mathematical operations inside of your expressions, and this can be a useful technique to subset your data in interesting ways.

For example, we can use the modulo operator (%) on index to retrieve the «odd» rows of our DataFrame:

sales_data.query('index%2 == 1')

OUT:

     name region  sales  expenses
1    Emma  North  52000     43000
3  Markus  South  34000     44000
5  Thomas   West  72000     39000
7  Olivia   West  55000     60000
9   Anika   East  65000     44000

Using index inside of your expressions for query is a good way of conditionally subsetting on the index.

EXAMPLE 4: Subset a pandas dataframe by comparing two columns

Now, let’s make things a little more complicated.

Here, we’re going to compare two variables – sales and expenses – and return the rows where sales is less than expenses.

sales_data.query('sales < expenses')

OUT:

     name region  sales  expenses
3  Markus  South  34000     44000
7  Olivia   West  55000     60000

This is pretty straightforward. Our logical expression, 'sales < expenses', instructs the query method to return rows where sales is less than expenses. And that’s what is in the output!

Keep in mind that we can use more than two variable in our expressions, or make the expressions more complicated with logical operators.

I’ll show you an example of that in the next example.

EXAMPLE 5: Subset a pandas dataframe with multiple conditions

Here, we’re going to subset the DataFrame based on a complex logical expression. The expression is composed of two smaller expressions that are being combined with the and operator.

We’re going to return rows where sales is greater than 50000 AND region is either ‘East’ or ‘West’.

Here’s the code:

sales_data.query('(sales > 50000) and (region in ["East", "West"])')

And here’s the output:

     name region  sales  expenses
2   Sofia   East  90000     50000
5  Thomas   West  72000     39000
7  Olivia   West  55000     60000
8    Arun   West  67000     39000
9   Anika   East  65000     44000

Notice which rows are actually in the output. The rows all have a region that’s either ‘East’ or ‘West’. Additionally, all of the rows have sales greater than 50000.

We did this by creating a compound logical expression using the and operator.

The overall expression inside of query is '(sales > 50000) and (region in ["East", "West"])'.

Notice that there are two parts. The first part is (sales > 50000). The second part is (region in ["East", "West"]). Both of these parts are small expressions themselves that will subset our DataFrame. But here, we’re combining them together with the logical ‘and‘ operator. This tells query to return rows where both parts are True.

If you want, you could also use the logical ‘or‘ operator and the logical ‘not‘ operator inside of your expressions. Essentially, you use the Python logical operators to create more complicated expressions and subset your data in more complex ways.

EXAMPLE 6: reference local variables inside of query

Now, let’s do something a little different.

So far, we’ve been referencing variables that are actually inside of the DataFrame. We’ve been referencing the names of the columns, like sales and expenses.

Now, we’re going to reference a variable outside of the dataframe.

Here, we’re going to calculate the mean of the sales variable and store it as a separate variable outside of our DataFrame.

sales_mean = sales_data.sales.mean()

And let’s take a look at the value:

print(sales_mean)

OUT:

58454.545

This variable, sales_mean, simply holds the value of the mean of our sales variable.

Next, we’re going to reference that variable.

To do this, we’re going to use the ‘@‘ character in front of the variable.

Let’s look at the code:

sales_data.query('sales > @sales_mean')

And here is the output:

      name region  sales  expenses
2    Sofia   East  90000     50000
5   Thomas   West  72000     39000
8     Arun   West  67000     39000
9    Anika   East  65000     44000
10   Paulo  South  67000     45000

Notice that for all of the rows, sales is greater than 58454.545 (the mean of sales).

EXAMPLE 7: Modify a DataFrame in Place

Finally, let’s do one more example.

Here, we’re going to modify a DataFrame «in place».

That means that we’re going to directly modify the DataFrame that we’re operating on, instead of producing a new DataFrame.

Now keep in mind: modifying a DataFrame in place can be risky, because we will overwrite our data.

That being the case, I’m actually going to create a duplicate DataFrame first called sales_data_dupe. This way, when we modify the data, we’ll overwrite the duplicate and keep our original intact.

sales_data_dupe = sales_data.copy()

Now, let’s modify sales_data_dupe.

Here, we’re going to select rows where 'index < 5', but we’re going to modify the DataFrame directly by setting inplace = True.

sales_data_dupe.query('index < 5', inplace = True)

And let’s print out the output:

print(sales_data_dupe)

OUT:

      name region  sales  expenses
0  William   East  50000     42000
1     Emma  North  52000     43000
2    Sofia   East  90000     50000
3   Markus  South  34000     44000
4   Edward   West  42000     38000

Notice that the rows in sales_data_dupe meet our criteria that the index must be less than 5. Moreover, notice that query modified the DataFrame directly (instead of producing a new DataFrame).

Frequently asked questions about Pandas query

Here are a few frequently asked questions about the Pandas query method.

Frequently asked questions:

What’s the advantage of using .query() over brackets
How does query in Pandas compare to SQL code

What’s the advantage of using `.query()` over brackets

In Python, there are many ways to select rows from a Pandas dataframe.

By far, the most common is to subset by using «bracket notation». Here’s an example (note that we’re using the DataFrame sales_data created above):

sales_data[sales_data['sales'] > 50000]

This is essentially equivalent to this code using query:

sales_data.query('sales > 50000')

Notice how much cleaner the code is when we use query. It’s easier to read. The code looks more like «English» and less like jumbled computer syntax.

In the «bracket» version, we need to repeatedly type the name of the DataFrame. Just to reference a variable, we need to type the name of the DataFrame (i.e., sales_data['sales']).

In the query version though, we just type the name of the column.

Moreover, when we use the Pandas query method, we can use the method in a «chain» of Pandas methods, similar to how you use pipes in R’s dplyr.

Ultimately, using the query method is easier to write, easier to read, and more powerful because you can use it in conjunction with other methods.

How does query in Pandas compare to SQL code

If you’re coming from an SQL background, you might be trying to figure out how to re-write your SQL queries with Pandas.

If that’s the case, you need to know how to translate SQL syntax into Pandas.

So if you were trying to translate SQL into Pandas, how does .query() fit in?

The query method is like a where statement in SQL.

You can use query to specify conditions that your rows must meet in order to be returned.

Leave your other questions in the comments below

Do you have more questions about the Pandas query method?

Leave your question in the comments section below.

Join our course to learn more about Pandas

If you’re serious about learning Pandas, you should enroll in our premium Pandas course called Pandas Mastery.

Pandas Mastery will teach you everything you need to know about Pandas, including:

How to subset your Python data
Data aggregation with Pandas
How to reshape your data
and more …

Moreover, it will help you completely master the syntax within a few weeks. You’ll discover how to become «fluent» in writing Pandas code to manipulate your data.

Find out more here:

Learn More About Pandas Mastery

Pandas DataFrame.query() method is used to query the rows based on the expression (single or multiple column conditions) provided and returns a new DataFrame. In case you wanted to update the existing referring DataFrame use inplace=True argument.

In this article, I will explain the syntax of the Pandas DataFrame query() method and several working examples like query with multiple conditions and query with string contains to new few.

Related:

pandas.DataFrame.filter() – To filter rows by index and columns by name.
pandas.DataFrame.loc[] – To select rows by indices label and column by name.
pandas.DataFrame.iloc[] – To select rows by index and column by position.
pandas.DataFrame.apply() – To custom select using lambda function.

1. Quick Examples of pandas query()

If you are in hurry, below are quick examples of how to use pandas.DataFrame.query() method.


# Query Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

#Using variable
value='Spark'
df2=df.query("Courses == @value")

#inpace
df.query("Courses == 'Spark'",inplace=True)

#Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

If you are a learner, Let’s see with sample data and run through these examples and explore the output to understand better. First, let’s create a pandas DataFrame from Dict.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print(df)

Note that the above DataFrame also contains None and Nan values on Duration column that I would be using in my examples below to select rows that has None & Nan values or select ignoring these values.

3. Using DataFrame.query()

Following is the syntax of DataFrame.query() method.


# query() method syntax
DataFrame.query(expr, inplace=False, **kwargs)

expr – expression takes conditions to query rows
inplace – Defaults to False. When set toTrue, it updates the referring DataFrame and query() method returns None.
**kwargs – Keyword arguments that works with eval()

DataFrame.query() takes condition in expression to select rows from a DataFrame. This expression can have one or multiple conditions.


# Query all rows with Courses equals 'Spark'
df2=df.query("Courses == 'Spark'")
print(df2)

Yields below output.


  Courses    Fee Duration  Discount
0   Spark  22000   30days      1000

In case you wanted to use a variable in the expression, use @ character.


# Query Rows by using Python variable
value='Spark'
df2=df.query("Courses == @value")
print(df2)

If you notice the above examples return a new DataFrame after filtering the rows. if you wanted to update the existing DataFrame use inplace=True


# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print(df)

If you wanted to select based on column value not equals then use != operator.


# not equals condition
df2=df.query("Courses != 'Spark'")

Yields below output.


   Courses  Courses Fee Duration  Discount
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

4. Select Rows Based on List of Column Values

If you have values in a python list and wanted to select the rows based on the list of values, use in operator, it’s like checking a value contains in a list of string values.


# Query Rows by list of values
print(df.query("Courses in ('Spark','PySpark')"))

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

You can also write with a list of values in a python variable.


# Query Rows by list of values
values=['Spark','PySpark']
print(df.query("Courses in @values"))

To select rows that are not in a list of column values can be done using not in operator.


# Query Rows not in list of values
values=['Spark','PySpark']
print(df.query("Courses not in @values"))

If you have column names with special characters using column name surrounded by tick ` character .


# Using columns with special characters
print(df.query("`Courses Fee` >= 23000"))

5. Query with Multiple Conditions

In Pandas or any table-like structures, most of the time we would need to select the rows based on multiple conditions by using multiple columns, you can do that in Pandas DataFrame as below.


# Query by multiple conditions
print(df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000"))

Yields below output. Alternatively, you can also use pandas loc with multiple conditions.


  Courses  Courses Fee Duration  Discount
2  Hadoop        23000   30days      1000
3  Python        24000     None      1200

6. Query Rows using apply()

pandas.DataFrame.apply() method is used to apply the expression row-by-row and return the rows that matched the values. The below example returns every match when Courses contains a list of specified string values.


# By using lambda function
print(df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])]))

Yields below output. A lambda expression is used with pandas to apply the function for each row.


   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

8. Other Examples using df[] and loc[]


# Other examples you can try to query rows
df[df["Courses"] == 'Spark'] 
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

#Select startswith
print(df[df['Courses'].str.startswith("P")])

Conclusion

In this article, I have explained multiple examples of how to query Pandas DataFrame Rows based on single and multiple conditions, from a list of values (checking column value exists in list of string values) e.t.c. Remember when you query DataFrame Rows, it always returns a new DataFrame with selected rows, in order to update existing df you have to use inplace=True. I hope this article helps you learn Pandas.

Happy Learning !!

Different Ways to Rename Pandas DataFrame Column
How to Drop Column From Pandas DataFrame
Pandas- How to get a Specific Cell Value from DataFrame
Pandas Filter DataFrame by Multiple Conditions
Pandas apply map (applymap()) Explained
Apply Multiple Filters to Pandas DataFrame or Series
Pandas Filter Rows by Conditions
Pandas Filter by Column Value

References

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
https://pandas.pydata.org/docs/reference/api/pandas.eval.html#pandas.eval

Источник

Время прочтения: 3 мин.

Pandas, безусловно, является одним из основных инструментов, необходимых для работы с данными в Python. Он предлагает множество различных способов фильтрации фреймов данных. Один из них я сегодня покажу на примере.

Недавно подошла сотрудница с задачей — в файле нужно найти материалы на складах и сформировать файл для фильтрации по группам, чтобы можно удобно просмотреть в Excel. Количество материалов более 300 тысяч.

Пример:

При использовании библиотеки Левенштейна пришлось бы использовать цикл в цикле, при большом количестве данных проверка могла бы затянутся на долго. Анализ данных требует много операций фильтрации. Pandas предоставляют множество методов для фильтрации фрейма данных, и Dataframe.query() является одним из них.

import pandas as pd
pd.options.display.max_columns = None
      
mPath1=r"C:UsersnikiDesktopRACK"            
PATH =  mPath1+"\Материалы.XLSX"
f1 = open((mPath1+ '\1205_01_sort.csv'),mode="a", encoding="utf-8")
df = pd.read_excel(PATH,'Sheet1')

При загрузке файла в колонке Материала, все привел к верхнему регистру, удалил лишние пробелы и знаки

df['Lt'] = df['Material].str.replace('W+',' ').str.replace('_',' ').str.strip().str.replace('  ',' ')
df['IND1']=0  
df['IND2']=0  
slist=df.iloc[0,4].split(' ')
kl=0
Col_text=2     # По двум текстам совпадения
Dl_text=2       # Кол-во символов в тексте
df['Lt'] = df['Lt'].fillna(' ')
while len(df)!=0:
    kl=kl+1
    kol=0
    ql=''
    df=df.loc[df['IND1']==0]
   if len(df.iloc[0,4])>=3:
        slist=df.iloc[0,4].split(' ')
        if len(slist)>1:
            for i in range(len(slist)):
                if len(slist[i])> Dl_text:
                    kol=kol+1
                    if kol<= Col_text:    # По двум текстам совпадения. При условии текст больше 2 символов
                        ql=ql+'and Lt.str.contains("'+slist[i]+'") '
        else:
            ql=ql+'and Lt.str.contains("'+slist[0]+'") ' 
# Формировалась строка ql для фильтрации датафрейма 
        
        if len(ql)>4:
            ql=ql[4:]
            ind_m = df.query(ql, engine='python').index
            df.at[ind_m, 'IND1'] = 1
            df.at[ind_m, 'IND2'] = kl
        for j in range(len(ind_m)): 
            sk1=df.loc[ind_m[j]][0]
            sk2=df.loc[ind_m[j]][2]
            sk3=df.loc[ind_m[j]][5]
            sk4=df.loc[ind_m[j]][6]

IND1 по умолчанию равен 0, после, если товар попадает под фильтр, получает значение 1.

IND2 (после фильтрации IND1) получает значение kol – это индекс для фильтрации в Excel, который используется в полученном файле

            sks='"'+str(sk1)+'";"'+str(sk2)+'";"'+str(sk3)+'";"'+str(sk4)+'"'
            f1.write(sks + 'n')
        print(kl,len(df))  
    else:
        df.iloc[0,5]=1
f1.close()

Результат:

Теперь загрузив файл в Excel, можно сделать фильтрацию по колонке IND2. Сохранение в файл сделал по запросу сотрудника, чтобы видеть результат по заполнения файла в работе программы, файл можно просматривать в реальном времени либо с FAR, Notepad++. Программы доступны в SberUserSoft.

Преимущество этого способа является подходящим для анализа данных. В этом примере я использовал функцию запроса по элементам списка.

В заключение хотелось бы сказать, что Pandas, при работе с большими объемами данных, является большой альтернативой Excel.

Источник

This tutorial will show you how to use the Pandas query method to subset your data.

The tutorial will explain the syntax and also show you step-by-step examples of how to use the Pandas query method.

If you need something specific (like help with syntax, examples, etc), you can click on one of the following links and it will take you to the appropriate section.

Contents:

A Quick Review of Pandas
Introduction to Pandas Query
The syntax of Pandas Query
Pandas Query Examples
Pandas Query FAQ