Diploma


2017,

Most Russian universities have in possession systems, which perform collection and storage of data on their students’ study progress. At the same time the analysis tools, which could exploit these large amounts of data and thus help manage the educational process better, are rarely present. My diploma project was an attempt to build such a framework in the context of Bauman Moscow State Technical University.

The project was focused on applying popular data analysis techniques and tools to the available data set. The latter was provided by the "EU" system maintained by BMSTU. A multidimensional OLAP model was built to offer potential users a convenient tooling for exploring data. On the other side, a set of Data Mining models was developed to perform deeper analysis and discover patterns in data. Their main aim was to allow for classification of students as having high or low risk of facing severe problems during the next exams session. Both kinds of models were created using the Microsoft SQL Server Analysis Services framework and the corresponding SQL Server Data Tools IDE.

It was necessary to have an easy-to-access data source to develop the analysis tools. A relational database hosted with Microsoft SQL Server was used for this purpose. It was filled with data by means of a .NET application, which retrieved data from the "EU" web-services (SOAP- and REST-based), performed simple preprocessing and stored data in the local database. An additional application was written in Python to create the initial blank database (setup tables and views) and to prepare it for usage as a datasource with analysis tools after data is placed in it. The latter process involved generating additional static data (time-dimensions and other) and performing initial aggregation of measures. These preparation steps made the development of analysis tools much simpler. The decision to use Python for this application despite developing the data-retrieving one in C# was caused by the simplicity of the former language as well as by the fact that it was needed mainly to create and execute numerous SQL-queries against the database. At the same time using bare SQL was not viable since some portion of freedom offered by traditional imperative languages was required to perform measures aggregation easily.

Although all the components were set up and functional, it wasn’t possible to use the developed data mining models to make reliable predictions. The key problem was very low recall for predicting problems with exams. It is possible to highlight several potential reasons behind this failure and to suggest the corresponding solutions. First of all, there are very few ‘problematic’ records in comparison with the number of entries showing good students’ achievements. While this is certainly good for the university, it turned out that the Data Mining algorithms built into the SQL Server Analysis Services don’t perform well on skewed datasets. This suggests that trying to benefit from other models – i.e. either to implement own algorithms or to use third-party ones – may be a good idea. At the same time, there are certain minor flaws and inconsistencies in the "EU" system, acting as a source of data for the analysis models, which further complicate the task of building reliable prediction tools. This issue might be resolved to some extent by means of more complex preprocessing logic.

On the other side, all these problems don’t make the created OLAP model less usable and it is still capable of helping an analyst to draw conclusions from existing data. Moreover, the Data Mining models succeeded at providing valuable insights into existing data and allowed to make certain judgments about the students' chances to pass exams without major problems based on information on their in-term progress and other attributes. This makes it possible to qualify the project as successful. I hope that either me, or other students will continue to work in this direction to improve the created system and achieve the important goals that appeared to be out of reach this time.

More info

Slides PDF