Data Management and Visualization

This is a publication that provides data management for data exploration in the Mars craters dataset. Selected dataset: marscrate_pds This data set provides information about the morphological features of the surface of Mars. It contains data about recorded latitude, longitude, circular diameter, depth of rim floor, other morphological information, and the number of layers of craters on Mars.

By admin on Aug, 29 2023

The following attributes are extracted and mapped The following attributes were extracted and selected for analysis: Categorical: MORPHOLOGY_EJECTA_3 Quantity: DEPTH_RIMFLOOR_TOPOG The morphology ejecta 3 represents different morphology type descriptors. The depth rim floor variable is defined as the Average elevation of the manually determined N points along the crater rim to floor, the units are km. Procedure: The full dataset was loaded (384343 records), then was filtered the records with data in variable MORPHOLOGY EJECTA 3. The subset contains a total of 1293 registers. The interpretation of these data and the mean of the morphology defined in the variable MORPHOLOGY EJECTA 3 are different. To interpret we applied a test Tukey to compare the significant difference in the mean of the depth between types of morphology The original data contains a total of 28 different types of morphology (MORPHOLOGY EJECTA). Below is an extract of the results, due to the large number of possible combinations. the interpretation of results suggests that the mean of most types of morphology they do not present different statistical significance. Only we can find a difference significance between Bumblee and the types: Butterfly and Inner is small Crown. It found different statistical significance for rejecting hypothesis null, that is to say, there is statistical evidence to indicate the mean of these groups is different. Additionally, it is presented in a graph to give more information on data. import pandas as pd import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi from statsmodels.formula.api import ols from scipy import stats import statsmodels.api as sm import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl fname = "marscrater_pds" df = pd.read_csv(r"C:\dev\Course data analysis tools\Datasets\{}.csv".format(fname), low_memory=False) df = df[df['MORPHOLOGY_EJECTA_3'] != ' '] model = ols('DEPTH_RIMFLOOR_TOPOG ~ MORPHOLOGY_EJECTA_3', data=df).fit() # Creaci贸n modelo ANOVA anov_table = sm.stats.anova_lm(model, typ=1) # Corresponde a una anova Unidireccional print('ANOVA resultados {}: '.format(fname)) print(anov_table) # Ver resumen ANOVA print('\n') # Chequeo de hipotesis dataResid = model.resid # Prueba de normalidad realizado en los residuos de los datos normTest = stats.shapiro(dataResid) # Se utiliza el test de shapiro-wilks print('Normality test results:') print(normTest) # se considera como valor p-value > 0.05 para datos normalmente distribuidos. print('\n') letLis = df['MORPHOLOGY_EJECTA_3'].unique() letLis = letLis[letLis != ' '] datalist = [] for x in letLis: # Bartlett test works better if provided a list of data groupData = df.loc[df['MORPHOLOGY_EJECTA_3'] == x, 'DEPTH_RIMFLOOR_TOPOG'].tolist() datalist.append(groupData) bartData = stats.bartlett(datalist[0], datalist[1], datalist[2]) print('The results of the bartlett test are: ') print(bartData) # This is insiginifant, which is good. print('\n') # Sweet! So we have a significant ANOVA, and it passes all the tests. Time to see what treatments matter. dataCompare = multi.MultiComparison(df['DEPTH_RIMFLOOR_TOPOG'], df['MORPHOLOGY_EJECTA_3']) # Runs comparisons between treatments tukeyResult = dataCompare.tukeyhsd() # Runs a tukey HSD test based on those comparisons print(tukeyResult) # prints out a table displaying all relationships between groups print('Therefore, the unique treatments (a = 0.05) are (Instance, {}:'.format(fname)) print(dataCompare.groupsunique) # prints out a list of all treatments which are unique. df_anova = pd.DataFrame(data=tukeyResult._results_table.data[1:], columns=tukeyResult._results_table.data[0]) df_anova.to_csv('turkey_results_assigment1.csv') plt.figure() ax = plt.gca() # pull out just the background ax.set_facecolor('xkcd:white') # Remove the gray background ax.tick_params(labelsize = 8) ax.set_xticklabels(letLis, rotation = 90) mpl.rcParams['boxplot.medianprops.color'] = 'black' # Make the median line black and thus visable plt.boxplot(datalist, labels=letLis) # The data needs to be in list format for this # Additionally it is beneficial to have the labels for the treatments in a list ## Everything below this is styling plt.ylabel('Depth Rimfloor (km)', fontsize = 'large') plt.show()

Comments

No comments.