This is a publication that provides data management for data exploration in the Mars craters dataset. Selected dataset: marscrate_pds This data set provides information about the morphological features of the surface of Mars. It contains data about recorded latitude, longitude, circular diameter, depth of rim floor, other morphological information, and the number of layers of craters on Mars.
By admin on Aug, 29 2023
The following attributes are extracted and mapped
The following attributes were extracted and selected for analysis:
Categorical: MORPHOLOGY_EJECTA_3
Quantity: DEPTH_RIMFLOOR_TOPOG
The morphology ejecta 3 represents different morphology type descriptors.
The depth rim floor variable is defined as the Average elevation of the manually determined N points along the crater rim to floor, the units are km.
Procedure:
The full dataset was loaded (384343 records), then was filtered the records with data in variable MORPHOLOGY EJECTA 3. The subset contains a total of 1293 registers.
The interpretation of these data and the mean of the morphology defined in the variable MORPHOLOGY EJECTA 3 are different. To interpret we applied a test Tukey to compare the significant difference in the mean of the depth between types of morphology
The original data contains a total of 28 different types of morphology (MORPHOLOGY EJECTA). Below is an extract of the results, due to the large number of possible combinations.
the interpretation of results suggests that the mean of most types of morphology they do not present different statistical significance. Only we can find a difference significance between Bumblee and the types: Butterfly and Inner is small Crown. It found different statistical significance for rejecting hypothesis null, that is to say, there is statistical evidence to indicate the mean of these groups is different.
Additionally, it is presented in a graph to give more information on data.
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
from statsmodels.formula.api import ols
from scipy import stats
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
fname = "marscrater_pds"
df = pd.read_csv(r"C:\dev\Course data analysis tools\Datasets\{}.csv".format(fname), low_memory=False)
df = df[df['MORPHOLOGY_EJECTA_3'] != ' ']
model = ols('DEPTH_RIMFLOOR_TOPOG ~ MORPHOLOGY_EJECTA_3', data=df).fit() # Creaci贸n modelo ANOVA
anov_table = sm.stats.anova_lm(model, typ=1) # Corresponde a una anova Unidireccional
print('ANOVA resultados {}: '.format(fname))
print(anov_table) # Ver resumen ANOVA
print('\n')
# Chequeo de hipotesis
dataResid = model.resid # Prueba de normalidad realizado en los residuos de los datos
normTest = stats.shapiro(dataResid) # Se utiliza el test de shapiro-wilks
print('Normality test results:')
print(normTest) # se considera como valor p-value > 0.05 para datos normalmente distribuidos.
print('\n')
letLis = df['MORPHOLOGY_EJECTA_3'].unique()
letLis = letLis[letLis != ' ']
datalist = []
for x in letLis: # Bartlett test works better if provided a list of data
groupData = df.loc[df['MORPHOLOGY_EJECTA_3'] == x, 'DEPTH_RIMFLOOR_TOPOG'].tolist()
datalist.append(groupData)
bartData = stats.bartlett(datalist[0], datalist[1], datalist[2])
print('The results of the bartlett test are: ')
print(bartData) # This is insiginifant, which is good.
print('\n')
# Sweet! So we have a significant ANOVA, and it passes all the tests. Time to see what treatments matter.
dataCompare = multi.MultiComparison(df['DEPTH_RIMFLOOR_TOPOG'], df['MORPHOLOGY_EJECTA_3']) # Runs comparisons between treatments
tukeyResult = dataCompare.tukeyhsd() # Runs a tukey HSD test based on those comparisons
print(tukeyResult) # prints out a table displaying all relationships between groups
print('Therefore, the unique treatments (a = 0.05) are (Instance, {}:'.format(fname))
print(dataCompare.groupsunique) # prints out a list of all treatments which are unique.
df_anova = pd.DataFrame(data=tukeyResult._results_table.data[1:], columns=tukeyResult._results_table.data[0])
df_anova.to_csv('turkey_results_assigment1.csv')
plt.figure()
ax = plt.gca() # pull out just the background
ax.set_facecolor('xkcd:white') # Remove the gray background
ax.tick_params(labelsize = 8)
ax.set_xticklabels(letLis, rotation = 90)
mpl.rcParams['boxplot.medianprops.color'] = 'black' # Make the median line black and thus visable
plt.boxplot(datalist, labels=letLis) # The data needs to be in list format for this
# Additionally it is beneficial to have the labels for the treatments in a list
## Everything below this is styling
plt.ylabel('Depth Rimfloor (km)', fontsize = 'large')
plt.show()
Comments
No comments.