本文共 15492 字,大约阅读时间需要 51 分钟。
熊猫烧香教程
In this tutorial we will learn how to work with comma separated (CSV) files in Python and Pandas. We will get an overview of how to use Pandas to load CSV to dataframes and how to write dataframes to CSV.
在本教程中,我们将学习如何在Python和Pandas中使用逗号分隔(CSV)文件。 我们将概述如何使用Pandas将CSV加载到数据帧以及如何将数据帧写入CSV。
In the first section, we will go through, with examples, how to read a CSV file, how to read specific columns from a CSV, how to read multiple CSV files and combine them to one dataframe, and, finally, how to convert data according to specific datatypes (e.g., using Pandas read_csv dtypes). In the last section we will continue by learning how to write CSV files. That is, we will learn how to export dataframes to CSV files.
在第一部分中,我们将通过示例介绍如何读取CSV文件,如何从CSV读取特定列,如何读取多个CSV文件并将它们组合为一个数据帧,以及最后如何转换数据。根据特定的数据类型(例如,使用熊猫read_csv dtypes)。 在上一节中,我们将继续学习如何编写CSV文件。 也就是说,我们将学习如何将数据帧导出到CSV文件。
In the first example of this Pandas read CSV tutorial we will just use read_csv to load CSV to dataframe that is in the same directory as the script. If we have the file in another directory we have to remember to add the full path to the file. Here’s the first, very simple, Pandas read_csv example:
在此熊猫阅读CSV教程的第一个示例中,我们将仅使用read_csv将CSV加载到与脚本位于同一目录中的数据框。 如果文件在另一个目录中,则必须记住将完整路径添加到文件中。 这是第一个非常简单的Pandas read_csv示例:
df = pd.read_csv('amis.csv')df.head()
The data can be downloaded here but in the following examples we are going to use Pandas read_csv to load data from a URL.
可以在此处下载数据,但在以下示例中,我们将使用Pandas read_csv从URL加载数据。
In the next read_csv example we are going to read the same data from a URL. It’s very simple we just put the URL in as the first parameter in the read_csv method:
在下一个read_csv示例中,我们将从URL中读取相同的数据。 很简单,我们将URL作为read_csv方法中的第一个参数:
url_csv = 'https://vincentarelbundock.github.io/Rdatasets/csv/boot/amis.csv'df = pd.read_csv(url_csv)
As can be seen in the image above we get a column named ‘Unamed: 0’. We can also see that it contains numbers. Thus, we can use this column as index column. In the next code example we are going to use Pandas read_csv and the index_col parameter. This parameter can take an integer or a sequence. In our case we are going to use the integer 0 and we will get a way nicer dataframe:
如上图所示,我们获得了名为“ Unamed:0”的列。 我们还可以看到它包含数字。 因此,我们可以将此列用作索引列。 在下一个代码示例中,我们将使用Pandas read_csv和index_col参数。 此参数可以采用整数或序列。 在我们的例子中,我们将使用整数0,我们将获得更好的数据框:
df = pd.read_csv(url_csv, index_col=0)df.head()
The index_col parameter also can take a string as input and we will now use a different datafile. In the next example we will read a CSV into a Pandas dataframe and use the idNum column as index.
index_col参数也可以将字符串作为输入,我们现在将使用其他数据文件。 在下一个示例中,我们将CSV读取到Pandas数据框中,并使用idNum列作为索引。
csv_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv'df = pd.read_csv(csv_url, index_col='idNum')df.iloc[:, 0:6].head()
Note, to get the above output we used Pandas iloc to select the first 7 rows. This was done to get an output that could be easier illustrated. That said, we are now continuing to the next section where we are going to read certain columns to a dataframe from a CSV file.
注意,要获得上述输出,我们使用了Pandas iloc选择前7行。 这样做是为了获得可以更容易说明的输出。 就是说,我们现在继续下一节,我们将从CSV文件中读取某些列到数据框。
In some cases we don’t want to parse every column in the csv file. To only read certain columns we can use the parameter usecols. Note, if we want the first column to be index column and we want to parse the three first columns we need to have a list with 4 elements (compare my read_excel usecols example here):
在某些情况下,我们不想解析csv文件中的每一列。 要仅读取某些列,我们可以使用参数usecols。 请注意,如果我们希望第一列为索引列,并且要解析前三个列,则需要一个包含4个元素的列表(在此处比较我的read_excel usecols示例):
cols = [0, 1, 2, 3]df = pd.read_csv(url_csv, index_col=0, usecols=cols)df.head()
Of course, using read_csv usecols make more sense if we had a CSV file with more columns. We can use Pandas read_csv usecols with a list of strings, as well. In the next example we return to the larger file we used previously:
当然,如果我们有一个包含更多列的CSV文件,那么使用read_csv usecols更有意义。 我们也可以将Pandas read_csv usecols与字符串列表一起使用。 在下一个示例中,我们返回之前使用的较大文件:
csv_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv'df = pd.read_csv(csv_url, index_col='idNum', usecols=['idNum', 'date', 'problem', 'MDC'])df.head()
In some of the previous read_csv example we get an unnamed column. We have solved this by setting this column as index or used usecols to select specific columns from the CSV file. However, we may not want to do that for any reason. Here’s one example on how to use Pandas read_csv to get rid of the column “Unnamed:0”:
在前面的一些read_csv示例中,我们获得了一个未命名的列。 我们已通过将此列设置为索引或使用usecols从CSV文件中选择特定的列来解决此问题。 但是,由于某些原因,我们可能不想这样做。 这是一个有关如何使用Pandas read_csv摆脱“ Unnamed:0”列的示例:
csv_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv'cols = pd.read_csv(csv_url, nrows=1).columnsdf = pd.read_csv(csv_url, usecols=cols[1:])df.iloc[:, 0:6].head()
It’s of course also possible to remove the unnamed columns after we have loaded the CSV to a dataframe. To remove the unnamed columns we can use two different methods; loc and drop, together with other Pandas dataframe methods. When using the drop method we can use the inplace parameter and get a dataframe without unnamed columns.
当然,也可以在将CSV加载到数据框后删除未命名的列。 要删除未命名的列,我们可以使用两种不同的方法。 定位以及其他Pandas数据框方法。 当使用drop方法时,我们可以使用inplace参数并获取没有未命名列的数据框。
df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1, inplace=True)# The following line will give us the same result as the line above# df = df.loc[:, ~df.columns.str.contains('unnamed', case=False)]df.iloc[:, 0:7].head()
解释上面的代码示例; 我们选择没有包含字符串“未命名”的列的列。 此外,我们使用了case参数,以便contains方法不区分大小写。 因此,我们将获得名为“未命名”和“未命名”的列。 在第一行中,使用Pandas drop,我们还使用了inplace参数,以便它更改数据框。 但是,axis参数用于删除列而不是索引(即行)。
If we have missing data in our CSV file and it’s coded in a way that make it impossible for Pandas to find them we can use the parameter na_values. In the example below the amis.csv file have been changed and there are some cells with the string “Not Available”.
如果我们的CSV文件中缺少数据,并且编码方式使得熊猫无法找到它们,则可以使用参数na_values。 在下面的示例中,amis.csv文件已更改,并且有些单元格的字符串为“ Not Available”。
That is, we are going to change “Not Available” to something that we easily can remove when carrying out data analysis later.
也就是说,我们将把“不可用”更改为在以后进行数据分析时可以轻松删除的内容。
df = pd.read_csv('Simdata/MissingData.csv', index_col=0, na_values="Not Available")df.head()
What if our data file(s) contain information on the first x rows? For instance, how can we skip the first three rows in a file looking like this:
如果我们的数据文件包含前x行的信息怎么办? 例如,我们如何跳过文件中的前三行,如下所示:
We will now learn how to use Pandas read_csv and skip x amount of row. Luckily, it’s very simple we just use the skiprows parameter. In the following example we are using read_csv and skiprows=3 to skip the first 3 rows.
现在,我们将学习如何使用熊猫read_csv并跳过x的行数。 幸运的是,非常简单,我们只使用skiprows参数。 在以下示例中,我们使用read_csv和skiprows = 3跳过前3行。
df = pd.read_csv('Simdata/skiprow.csv', index_col=0, skiprows=3)df.head()
Note we can obtain the same result as above using the header parameter (i.e., data = pd.read_csv(‘Simdata/skiprow.csv’, header=3)).
注意,我们可以使用header参数获得与上述相同的结果(即data = pd.read_csv('Simdata / skiprow.csv',header = 3))。
If we don’t want to read every row in the CSV file we ca use the parameter nrows. In the next example below we read the first 8 rows of a CSV file.
如果我们不想读取CSV文件中的每一行,则可以使用参数nrows。 在下面的下一个示例中,我们读取了CSV文件的前8行。
df = pd.read_csv(url_csv, nrows=8)df
If we want to select random rows we can load the complete CSV file and use Pandas sample to randomly select rows (learn more about this by reading the .
如果我们想选择随机行,我们可以加载完整的CSV文件并使用Pandas示例来随机选择行(通过阅读了解更多信息 。
We can also set the data types for the columns. Although, in the amis dataset all columns contain integers we can set some of them to string data type. This is exactly what we will do in the next Pandas read_csv pandas example. We will use the Pandas read_csv dtype parameter and put in a dictionary:
我们还可以设置列的数据类型。 虽然在amis数据集中所有列都包含整数,但我们可以将其中一些设置为字符串数据类型。 这正是我们在下一个Pandas read_csv pandas示例中将要执行的操作。 我们将使用Pandas的read_csv dtype参数并将其放入字典中:
url_csv = 'https://vincentarelbundock.github.io/Rdatasets/csv/boot/amis.csv'df = pd.read_csv(url_csv, dtype={'speed':int, 'period':str, 'warning':str, 'pair':int})df.info()
It’s ,of course, possible to force other datatypes such as integer and float. All we have to do is change str to float, for instance (given that we have decimal numbers in that column, of course).
当然,可以强制使用其他数据类型,例如整数和浮点数。 例如,我们要做的就是将str更改为float(当然,假设该列中有十进制数字)。
If we have data from many sources such as experiment participants we may have them in multiple CSV files. If the data, from the different CSV files, are going to be analyzed together we may want to load them all into one dataframe. In the next examples we are going to use Pandas read_csv to read multiple files.
如果我们有许多来源(例如实验参与者)的数据,我们可能会将它们保存在多个CSV文件中。 如果要将来自不同CSV文件的数据一起分析,我们可能希望将它们全部加载到一个数据框中。 在下面的示例中,我们将使用Pandas read_csv读取多个文件。
First, we are going to use Python and to list all files with the word “Day” of the file type CSV in the directory “SimData”. Next, we are using Python list comprehension to load the CSV files into dataframes (stored in a list, see the type(dfs) output).
首先,我们将使用Python 和在目录“ SimData”中列出文件类型为CSV的单词“ Day”的所有文件。 接下来,我们将使用Python列表推导将CSV文件加载到数据帧中(存储在列表中,请参见type(dfs)输出)。
import os, fnmatchcsv_files = fnmatch.filter(os.listdir('./SimData'), '*Day*.csv')dfs = [pd.read_csv('SimData/' + os.sep + csv_file) for csv_file in csv_files]type(dfs)# Output: list
Finally, we use the method concat to concatenate the dataframes in our list. In the example files there is a column called ‘Day’ so that each day (i.e., CSV file) is unique.
最后,我们使用concat方法来连接列表中的数据框。 在示例文件中,有一列称为“天”,因此每一天(即CSV文件)都是唯一的。
df = pd.concat(dfs, sort=False)df.Day.unique()
The second method we are going to use is a bit simpler; using Python . If we compare the two methods (os + fnmatch vs. glob) we can see that in the list comprehension we don’t have to put the path. This is because glob will have the full path to our files. Handy!
我们将要使用的第二种方法比较简单。 使用Python 。 如果我们比较这两种方法(os + fnmatch和glob),我们可以看到在列表理解中我们不必放置路径。 这是因为glob将具有我们文件的完整路径。 便利!
import globcsv_files = glob.glob('SimData/*Day*.csv')dfs = [pd.read_csv(csv_file) for csv_file in csv_files]df = pd.concat(dfs, sort=False)
If we don’t have a column, in each CSV file, identifying which dataset it is (e.g., data from different days) we could apply the filename in a new column of each dataframe:
如果我们没有一列,则在每个CSV文件中,确定它是哪个数据集(例如,不同日期的数据),可以在每个数据框的新列中应用文件名:
import globcsv_files = glob.glob('SimData/*Day*.csv')dfs = []for csv_file in csv_files: temp_df = pd.read_csv(csv_file) temp_df['DataF'] = csv_file.split('')[1] dfs.append(temp_df)
In this section we will learn how to export dataframes to CSV files. We will start by creating a dataframe with some variables but first we start by importing the modules Pandas:
在本节中,我们将学习如何将数据框导出到CSV文件。 我们将首先创建一个包含一些变量的数据框,但首先我们将导入Pandas模块:
import pandas as pd
The next step is to create a dataframe. We will create the dataframe using a dictionary. The keys will be the column names and the values will be lists containing our data:
下一步是创建一个数据框。 我们将使用字典创建数据框。 键将是列名,值将是包含我们的数据的列表:
df = pd.DataFrame({'Names':['Andreas', 'George', 'Steve', 'Sarah', 'Joanna', 'Hanna'], 'Age':[21, 22, 20, 19, 18, 23]})df.head()
然后,我们使用Pandas to_csv方法将数据帧写入CSV文件。 在下面的示例中,除了path_or_buf(在本例中为文件名)之外,我们不使用任何参数。
df.to_csv('NamesAndAges.csv')
Here’s how the exported dataframe look like:
导出的数据框如下所示:
As can be seen in the image above we get a new column when we are not using any parameters. This column is the index column from our Pandas dataframe. We can use the parameter index and set it to False to get rid of this column.
如上图所示,当我们不使用任何参数时,将获得一个新列。 此列是我们Pandas数据框的索引列。 我们可以使用参数index并将其设置为False来摆脱此列。
df.to_csv('NamesAndAges.csv', index=False)
If we have many dataframes and we want to export them all to the same CSV file it is, of course, possible. In the Pandas to_csv example below we have 3 dataframes. We are going to use Pandas concat with the parameters keys and names.
当然,如果我们有许多数据框,并且希望将它们全部导出到相同的CSV文件中,则可以。 在下面的Pandas to_csv示例中,我们有3个数据框。 我们将使用带有参数键和名称的Pandas concat。
This is done to create two new columns, named Group and Row Num. The important part is Group which will identify the different dataframes. In the last row of the code example we use Pandas to_csv to write the dataframes to CSV.
这样做是为了创建两个名为“组”和“行号”的新列。 重要的部分是组,它将识别不同的数据框。 在代码示例的最后一行,我们使用Pandas to_csv将数据帧写入CSV。
df1 = pd.DataFrame({'Names': ['Andreas', 'George', 'Steve', 'Sarah', 'Joanna', 'Hanna'], 'Age':[21, 22, 20, 19, 18, 23]})df2 = pd.DataFrame({'Names': ['Pete', 'Jordan', 'Gustaf', 'Sophie', 'Sally', 'Simone'], 'Age':[22, 21, 19, 19, 29, 21]})df3 = pd.DataFrame({'Names': ['Ulrich', 'Donald', 'Jon', 'Jessica', 'Elisabeth', 'Diana'], 'Age':[21, 21, 20, 19, 19, 22]})df = pd.concat([df1, df2, df3], keys =['Group1', 'Group2', 'Group3'], names=['Group', 'Row Num']).reset_index()df.to_csv('MultipleDfs.csv', index=False)
In the CSV file we get 4 columns. The keys parameter with the list ([‘Group1’, ‘Group2’, ‘Group3’]) will enable identification of the different dataframes we wrote. We also get the column “Row Num” which will contain the row numbers for each dataframe:
在CSV文件中,我们得到4列。 带有列表的keys参数(['Group1','Group2','Group3'])将使您能够识别我们编写的不同数据帧。 我们还将获得“行数”列,其中将包含每个数据帧的行号:
In this tutorial we have learned about importing CSV files into Pandas dataframe. More Specifically, we have learned how to:
在本教程中,我们学习了如何将CSV文件导入Pandas数据框。 更具体地说,我们学习了如何:
翻译自:
熊猫烧香教程
转载地址:http://xgmwd.baihongyu.com/