Most of the sequence operations we have to do in our pandas information cleansing initiatives could be assisted by AI instruments, together with by PandasAI. PandasAI takes benefit of enormous language fashions, comparable to that from OpenAI, to allow pure language queries and operations on information columns. On this put up, we look at how you can use PandasAI to question Sequence values, create new Sequence, set Sequence values conditionally, and reshape our information.
You may set up PandasAI by getting into pip set up pandasai right into a terminal or into Home windows Powershell. Additionally, you will must get a token from openai.com to ship a request to the OpenAI API.
Because the PandasAI library is creating quickly, you may anticipate totally different outcomes relying on the variations of PandasAI and pandas you might be utilizing. On this article, I exploit model 1.4.8 of PandasAI and model 1.5.3 of pandas.
We are going to work with information from the Nationwide Longitudinal Examine of Youth (NLS) performed by the US Bureau of Labor Statistics. The NLS has surveyed the identical cohort of highschool college students for over 25 years, and has helpful information gadgets on instructional outcomes and weeks labored for every of these years, amongst many different variables. It’s out there for public use at nlsinfo.org. (The NLS public releases are lined by the US authorities Open Information Coverage, which allows each non-commercial and industrial use.)
We may also work with COVID-19 information supplied by Our World in Information. That dataset has one row per nation per day with variety of new circumstances and new deaths. This dataset is obtainable for obtain at ourworldindata.org/covid-cases, with a Inventive Commons CC BY 4.0 license. You too can obtain all code and information used on this put up from GitHub.
We begin by importing the OpenAI and SmartDataframe modules from PandasAI. We additionally must instantiate an llm object:
import pandas as pd
from pandasai.llm.openai import OpenAI
from pandasai import SmartDataframe
llm = OpenAI(api_token="Your API Token")
Subsequent, we load the DataFrames we will likely be utilizing and create a SmartDataframe object from the NLS pandas DataFrame:
covidcases = pd.read_csv("information/covidcases.csv")
nls97 = pd.read_csv("information/nls97f.csv")
nls97.set_index("personid", inplace=True)
nls97sdf = SmartDataframe(nls97, config={"llm": llm})
Now we’re able to generate abstract statistics on Sequence from our SmartDataframe. We are able to ask for the typical for a single Sequence, or for a number of Sequence:
nls97sdf.chat("Present common of gpaoverall")2.8184077281812128
nls97sdf.chat("Present common for every weeks labored column")
Common Weeks Labored
0
weeksworked00 26.42
weeksworked01 29.78
weeksworked02 31.83
weeksworked03 33.51
weeksworked04 35.10
weeksworked05 37.34
weeksworked06 38.44
weeksworked07 39.29
weeksworked08 39.33
weeksworked09 37.51
weeksworked10 37.12
weeksworked11 38.06
weeksworked12 38.15
weeksworked13 38.79
weeksworked14 38.73
weeksworked15 39.67
weeksworked16 40.19
weeksworked17 40.37
weeksworked18 40.01
weeksworked19 41.22
weeksworked20 38.35
weeksworked21 36.17
weeksworked22 11.43
We are able to additionally summarize Sequence values by one other Sequence, normally one that’s categorical:
nls97sdf.chat("Present satmath common by gender")Feminine Male
0 486.65 516.88
We are able to additionally create a brand new Sequence with the chat
methodology of SmartDataframe. We don’t want to make use of the precise column names. For instance, PandasAI will determine that we wish the childathome Sequence after we write baby at dwelling:
nls97sdf = nls97sdf.chat("Set childnum to baby at dwelling plus baby not at
dwelling")
nls97sdf[['childnum','childathome','childnotathome']].
pattern(5, random_state=1)childnum childathome childnotathome
personid
211230 2.00 2.00 0.00
990746 3.00 3.00 0.00
308169 3.00 1.00 2.00
798458 NaN NaN NaN
312009 NaN NaN NaN
We are able to use the chat
methodology to create Sequence values conditionally:
nls97sdf = nls97sdf.chat("evermarried is 'No' when maritalstatus is
'By no means-married', else 'Sure'")
nls97sdf.groupby(['evermarried','maritalstatus']).dimension()evermarried maritalstatus
No By no means-married 2767
Sure Divorced 669
Married 3068
Separated 148
Widowed 23
dtype: int64
PandasAI is kind of versatile relating to the language you may use right here. For instance, the next gives the identical outcomes:
nls97sdf = nls97sdf.chat("if maritalstatus is 'By no means-married' set
evermarried2 to 'No', in any other case 'Sure'")
nls97sdf.groupby(['evermarried2','maritalstatus']).dimension()evermarried2 maritalstatus
No By no means-married 2767
Sure Divorced 669
Married 3068
Separated 148
Widowed 23
dtype: int64
We are able to do calculations throughout various equally named columns:
nls97sdf = nls97sdf.chat("Set weeksworked for every row to the typical of
all weeksworked columns for that row")
This can calculate the typical of all weeksworked00-weeksworked22 columns and assign that to a brand new column referred to as weeksworked.
We are able to simply impute values the place they’re lacking primarily based on abstract statistics:
nls97sdf.gpaenglish.describe()rely 5,798
imply 273
std 74
min 0
25% 227
50% 284
75% 323
max 418
Title: gpaenglish, dtype: float64
nls97sdf = nls97sdf.chat("set lacking gpaenglish to the typical")
nls97sdf.gpaenglish.describe()
rely 8,984
imply 273
std 59
min 0
25% 264
50% 273
75% 298
max 418
Title: gpaenglish, dtype: float64
We are able to additionally use PandasAI to do some reshaping. Recall that the COVID-19 case information has new circumstances for every day for every nation. Let’s say we solely need the primary row of knowledge for every nation. We are able to do this the normal manner with drop_duplicates:
firstcase = covidcases.
sort_values(['location','casedate']).
drop_duplicates(['location'], hold='first')firstcase.set_index('location', inplace=True)
firstcase.form
(231, 67)
firstcase[['iso_code','continent','casedate',
'total_cases','new_cases']].head(2).T
location Afghanistan Albania
iso_code AFG ALB
continent Asia Europe
casedate 2020-03-01 2020-03-15
total_cases 1.00 33.00
new_cases 1.00 33.00
We are able to get the identical outcomes by making a SmartDataframe and utilizing the chat
methodology. The pure language I exploit right here is remarkably easy, Present first casedate and placement and different values for every nation:
covidcasessdf = SmartDataframe(covidcases, config={"llm": llm})
firstcasesdf = covidcasessdf.chat("Present first casedate and placement and
different values for every nation.")firstcasesdf.form
(231, 7)
firstcasesdf[['location','continent','casedate',
'total_cases','new_cases']].head(2).T
iso_code ABW AFG
location Aruba Afghanistan
continent North America Asia
casedate 2020-03-22 2020-03-01
total_cases 5.00 1.00
new_cases 5.00 1.00
Discover that PandasAI makes good selections concerning the columns to get. We get the columns we want somewhat than all of them. We may have additionally simply handed the names of the columns we needed to chat
. (PandasAI sorted the rows by iso_code, somewhat than by location, which is why the primary row is totally different.)
A lot of the work when utilizing PandasAI is admittedly simply importing the related libraries and instantiating massive language mannequin and SmartDataframe objects. As soon as that’s finished, easy sentences despatched to the chat
methodology of the SmartDataframe are adequate to summarize Sequence values and create new Sequence.
PandasAI excels at producing easy statistics from Sequence. We don’t even want to recollect the Sequence identify precisely. Usually the pure language we would use could be extra intuitive than conventional pandas strategies like groupby. The Present satmath common by gender worth handed to chat
is an effective instance of that.
Operations on Sequence, together with the creation of a brand new Sequence, can also be fairly easy. We created a complete variety of youngsters Sequence (childnum) by instructing the SmartDataframe so as to add the variety of youngsters residing at dwelling to the variety of youngsters not residing at dwelling. We didn’t even present the literal Sequence names, childathome and childnotathome respectively. PandasAI found out what we meant.
Since we’re passing pure language directions to chat
for our Sequence operations, there isn’t a one proper strategy to get what we wish. For instance, we get the identical outcome after we handed evermarried is ‘No’ when maritalstatus is ‘By no means-married’, else ‘Sure’
to chat
as we did with if maritalstatus is ‘By no means-married’ set evermarried2 to ‘No’, in any other case ‘Sure’
.
We are able to additionally do pretty in depth DataFrame reshaping with easy pure language directions, as within the final command we supplied. We add and different values
to the directions to get columns aside from casedate. PandasAI additionally figures out that location is smart because the index.
You may learn extra about how you can use PandasAI and SmartDataframes right here:
Or within the second version of my e book, Python Information Cleansing Cookbook:
Good luck together with your information cleansing and I might love to listen to how issues are going!