Stratégies d'optimisation de requêtes SQL dans un écosystème Hadoop

( Télécharger le fichier original )
par Sébastien Frackowiak
Université de Technologie de COmpiègne - Master 2 2017

précédent sommaire suivant

Bitcoin is a swarm of cyber hornets serving the goddess of wisdom, feeding on the fire of truth, exponentially growing ever smarter, faster, and stronger behind a wall of encrypted energy

8.2 SQL sur Hadoop

8.2.1 Gérer manuellement le partitionnement dans une requête

Cet exemple est intéressant pour comprendre le rôle de « DISTRIBUTE BY » et « SORT BY ».

Créons une table contenant deux champs « x1 » et « y1 » et chargeons quelques données :

CREATE TABLE z_database1.test AS SELECT 'x1' AS field1,'y1' AS field2;INSERT INTO TABLE z_database1.test VALUES('x2','y2');INSERT INTO TABLE z_database1.test VALUES('x3','y3');INSERT INTO TABLE z_database1.test VALUES('x1','y3');INSERT INTO TABLE z_database1.test VALUES('x0','y3');

Forçons le nombre de Reducer à 3 :

hive> set mapreduce.job.reduces=3;hive> SELECT * FROM z_database1.test DISTRIBUTE BY field2 SORT BY field1;OKx2      y2x0      y3x1      y3x3      y3x1      y1

Nous remarquons bien la distribution par « field2 » et à l'intérieur, le tri par « field1 ».

8.2.2 Comprendre la sérialisation sous Hadoop

L'article suivant explique le principe de la sérialisation sous Hadoop :

http://www.dummies.com/programming/big-data/hadoop/defining-table-record-formats-in-hive/

précédent sommaire suivant

"I don't believe we shall ever have a good money again before we take the thing out of the hand of governments. We can't take it violently, out of the hands of governments, all we can do is by some sly roundabout way introduce something that they can't stop ..." Friedrich Hayek (1899-1992) en 1984