U-SQL 入門⑩ ~ 結合 ~
U-SQL では以下の結合が利用可能です。すべての JOIN において行セットの変数を指定することも SELECT 文を記述することもできます。
- CROSS JOIN
- INNER JOIN
- OUTER JOIN
- SEMIJOIN
- ANTISEMIJOIN
CROSS JOIN
行セットのデカルト積を返します。
@employees = SELECT * FROM (VALUES ("Rafferty", (int?) 31) , ("Jones", (int?) 33) , ("Heisenberg", (int?) 33) , ("Robinson", (int?) 34) , ("Smith", (int?) 34) , ("Williams", (int?) null)) AS E(EmpName, DepID); @departments = SELECT * FROM (VALUES ((int) 31, "Sales") , ((int) 33, "Engineering") , ((int) 34, "Clerical") , ((int) 35, "Marketing")) AS D(DepID, DepName); @rs_cross = SELECT e.EmpName, e.DepID, d.DepName FROM @employees AS e CROSS JOIN @departments AS d WHERE ((d.DepName == "Engineering") || (d.DepName == "Marketing")); OUTPUT @rs_cross TO "/Output/ReferenceGuide/Joins/CrossJoins/ExampleA.csv" USING Outputters.Csv(outputHeader : true);
出力結果
EmpName | DepID | DepName |
---|---|---|
Rafferty | 31 | Engineering |
Jones | 33 | Engineering |
Heisenberg | 33 | Engineering |
Robinson | 34 | Engineering |
Smith | 34 | Engineering |
Williams | Engineering | |
Rafferty | 31 | Marketing |
Jones | 33 | Marketing |
Heisenberg | 33 | Marketing |
Robinson | 34 | Marketing |
Smith | 34 | Marketing |
Williams | Marketing |
INNER JOIN
それぞれの行セットの指定したキーのカラムの値が一致するデータだけを返します。こちらの例では INNER JOIN にサブクエリを指定しています。
@employees = SELECT * FROM (VALUES ("Rafferty", (int?) 31) , ("Jones", (int?) 33) , ("Heisenberg", (int?) 33) , ("Robinson", (int?) 34) , ("Smith", (int?) 34) , ("Williams", (int?) null)) AS E(EmpName, DepID); @departments = SELECT * FROM (VALUES ((int) 31, "Sales") , ((int) 33, "Engineering") , ((int) 34, "Clerical") , ((int) 35, "Marketing")) AS D(DepID, DepName); @rs_inner = SELECT e.DepID AS EmpDepID, d.DepID, e.EmpName, d.DepName FROM @employees AS e INNER JOIN (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d ON e.DepID == d.DepID; OUTPUT @rs_inner TO "/output/rsInnerJoin.csv" USING Outputters.Csv(outputHeader : true);
出力結果
EmpDepID | DepID | EmpName | DepName |
---|---|---|---|
31 | 31 | Rafferty | Sales |
33 | 33 | Jones | Engineering |
33 | 33 | Heisenberg | Engineering |
34 | 34 | Robinson | Clerical |
34 | 34 | Smith | Clerical |
OUTER JOIN
指定したキーでそれぞれの行セットを結合します。キーが一致しない場合でも、指定した側のデータは維持されます。
LEFT OUTER JOIN の例
@employees = SELECT * FROM (VALUES ("Rafferty", (int?) 31) , ("Jones", (int?) 33) , ("Heisenberg", (int?) 33) , ("Robinson", (int?) 34) , ("Smith", (int?) 34) , ("Williams", (int?) null)) AS E(EmpName, DepID); @departments = SELECT * FROM (VALUES ((int?) 31, "Sales") , ((int?) 33, "Engineering") , ((int?) 34, "Clerical") , ((int?) 35, "Marketing")) AS D(DepID, DepName); @rs_leftouter = SELECT e.EmpName, d.DepName FROM @employees AS e LEFT OUTER JOIN @departments AS d ON e.DepID == d.DepID; OUTPUT @rs_leftouter TO "/output/rsLeftOuterJoin.csv" USING Outputters.Csv(outputHeader : true);
出力結果
EmpName | DepName |
---|---|
Rafferty | Sales |
Jones | Engineering |
Heisenberg | Engineering |
Robinson | Clerical |
Smith | Clerical |
Williams |
※RIGHT OUTER JOIN 、FULL OUTER JOIN も同様にサポートされます。
SEMIJOIN
SEMIJOIN は SELECT * FROM A WHERE A.Key IN (SELECT B.Key FROM B) のようなフィルターのような動作になります。LEFT SEMIJOIN と RIGHT SEMIJOIN の 2 つがあります。
LEFT SEMIJOIN の例
@employees = SELECT * FROM (VALUES ("Rafferty", (int?) 31) , ("Jones", (int?) 33) , ("Heisenberg", (int?) 33) , ("Robinson", (int?) 34) , ("Smith", (int?) 34) , ("Williams", (int?) null)) AS E(EmpName, DepID); @departments = SELECT * FROM (VALUES ((int?) 31, "Sales") , ((int?) 33, "Engineering") , ((int?) 34, "Clerical") , ((int?) 35, "Marketing")) AS D(DepID, DepName); @emps_in_valid_dept = SELECT e.EmpName, e.DepID FROM @employees AS e LEFT SEMIJOIN @departments AS d ON e.DepID == d.DepID; OUTPUT @emps_in_valid_dept TO "/output/rsLeftSemiJoinEmployeesInValidDept.csv" USING Outputters.Csv(outputHeader : true);
出力結果
EmpName | DepID |
---|---|
Rafferty | 31 |
Jones | 33 |
Heisenberg | 33 |
Robinson | 34 |
Smith | 34 |
ANTISEMIJOIN
ANTISEMIJOIN は SEMIJOIN の反対の動作をします。SELECT * FROM A WHERE A.key NOT IN (SELECT B.key FROM B) のように、マッチしない行を返します。
RIGHT ANTISEMIJOIN の例
@employees = SELECT * FROM (VALUES ("Rafferty", (int?) 31) , ("Jones", (int?) 33) , ("Heisenberg", (int?) 33) , ("Robinson", (int?) 34) , ("Smith", (int?) 34) , ("Williams", (int?) null)) AS E(EmpName, DepID); @departments = SELECT * FROM (VALUES ((int) 31, "Sales") , ((int) 33, "Engineering") , ((int) 34, "Clerical") , ((int) 35, "Marketing")) AS D(DepID, DepName); @depts_without_emps = SELECT d.DepName, d.DepID FROM @employees AS e RIGHT ANTISEMIJOIN (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d ON e.DepID == d.DepID; OUTPUT @depts_without_emps TO "/output/rsRightAntiSemiJoinDeptsWithoutEmployees.csv" USING Outputters.Csv(outputHeader : true);
出力結果
DepName | DepID |
---|---|
Marketing | 35 |
この記事はこちらの公式ドキュメントを参考にしたものです。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/u-sql-select-selecting-from-joins
U-SQL 入門⑨ ~ 変数の使用 ~
DECLARE を使って変数を宣言することができます。これを使うことによって U-SQL のメンテナンス性が向上します。
また、Azure Data Factory を使って U-SQL を実行する際は、Azure Data Factory で定義したパラメータを U-SQL に渡すことができます。
DECLARE @in string = "/Samples/Data/SearchLog.tsv"; DECLARE @out string = "/output/SearchLog-scalar-variables.csv"; @searchlog = EXTRACT UserId int, Start DateTime, Region string, Query string, Duration int?, Urls string, ClickedUrls string FROM @in USING Extractors.Tsv(); OUTPUT @searchlog TO @out USING Outputters.Csv();
T-SQL と異なり、以下の様な記述はエラーになります。
// エラーになる DECLARE @var string; DECLARE @var string = ""; SET @var = "a value"
以下の様に行セットも DECLARE では使用できません。
DECLARE @maxval int = SELECT MAX(value) FROM data;
この記事はこちらの公式ドキュメントを参考にしたものです。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/declare-variables-u-sql
Azure Data Factory から U-SQL パラメータを渡す場合
U-SQL スクリプトを使用したデータ変換 - Azure | Microsoft Docs
"parameters": { "in": "/Samples/Data/SearchLog.tsv", "out": "/output/SearchLog-scalar-variables.csv" }
U-SQL 入門⑧ ~ 行番号を振る ~
行に番号を振るには ROW_NUMBER ウィンドウ関数を使います。下記にいくつかの例を紹介します。
1. 普通に行番号を振る場合
@employees = SELECT * FROM ( VALUES ("Noah", "Engineering", 100, 10000), ("Sophia", "Engineering", 100, 15000), ("Liam", "Engineering", 100, 30000), ("Amy", "Engineering", 100, 35000), ("Justin", "Engineering", 100, 15000), ("Emma", "HR", 200, 8000), ("Jacob", "HR", 200, 8000), ("Olivia", "HR", 200, 8000), ("Mason", "Executive", 300, 50000), ("Ava", "Marketing", 400, 15000), ("Ethan", "Marketing", 400, 9000) ) AS T(EmpName, DeptName, DeptID, Salary); @result = SELECT ROW_NUMBER() OVER () AS RowNumber, EmpName, DeptName, Salary FROM @employees; OUTPUT @result TO "/Output/ReferenceGuide/Ranking/row_number/exampleA.csv" USING Outputters.Csv(outputHeader: true);
出力結果
RowNumber | EmpName | DeptName | Salary |
1 | Noah | Engineering | 10000 |
2 | Sophia | Engineering | 15000 |
3 | Liam | Engineering | 30000 |
4 | Amy | Engineering | 35000 |
5 | Justin | Engineering | 15000 |
6 | Emma | HR | 8000 |
7 | Jacob | HR | 8000 |
8 | Olivia | HR | 8000 |
9 | Mason | Executive | 50000 |
10 | Ava | Marketing | 15000 |
11 | Ethan | Marketing | 9000 |
2. Salary の高い順に振る場合
@employees = SELECT * FROM ( VALUES ("Noah", "Engineering", 100, 10000), ("Sophia", "Engineering", 100, 15000), ("Liam", "Engineering", 100, 30000), ("Amy", "Engineering", 100, 35000), ("Justin", "Engineering", 100, 15000), ("Emma", "HR", 200, 8000), ("Jacob", "HR", 200, 8000), ("Olivia", "HR", 200, 8000), ("Mason", "Executive", 300, 50000), ("Ava", "Marketing", 400, 15000), ("Ethan", "Marketing", 400, 9000) ) AS T(EmpName, DeptName, DeptID, Salary); // ORDER BY を使って Salary 列の降順で並べている @result = SELECT ROW_NUMBER() OVER (ORDER BY Salary DESC) AS RowNumber, EmpName, DeptName, Salary FROM @employees; OUTPUT @result TO "/Output/ReferenceGuide/Ranking/row_number/exampleB.csv" USING Outputters.Csv(outputHeader: true);
出力結果
RowNumber | EmpName | DeptName | Salary |
1 | Mason | Executive | 50000 |
2 | Amy | Engineering | 35000 |
3 | Liam | Engineering | 30000 |
4 | Sophia | Engineering | 15000 |
5 | Justin | Engineering | 15000 |
6 | Ava | Marketing | 15000 |
7 | Noah | Engineering | 10000 |
8 | Ethan | Marketing | 9000 |
9 | Emma | HR | 8000 |
10 | Jacob | HR | 8000 |
11 | Olivia | HR | 8000 |
3. 部署ごとに Salary の高い順に振る場合
@employees = SELECT * FROM ( VALUES ("Noah", "Engineering", 100, 10000), ("Sophia", "Engineering", 100, 15000), ("Liam", "Engineering", 100, 30000), ("Amy", "Engineering", 100, 35000), ("Justin", "Engineering", 100, 15000), ("Emma", "HR", 200, 8000), ("Jacob", "HR", 200, 8000), ("Olivia", "HR", 200, 8000), ("Mason", "Executive", 300, 50000), ("Ava", "Marketing", 400, 15000), ("Ethan", "Marketing", 400, 9000) ) AS T(EmpName, DeptName, DeptID, Salary); // PARTITION BY を使って DeptID ごとに分けている @result = SELECT ROW_NUMBER() OVER(PARTITION BY DeptID ORDER BY Salary DESC) AS RowNumberByDept, EmpName, DeptName, Salary FROM @employees; OUTPUT @result TO "/Output/ReferenceGuide/Ranking/row_number/exampleC.csv" USING Outputters.Csv(outputHeader: true);
RowNumberByDept | EmpName | DeptName | Salary |
1 | Amy | Engineering | 35000 |
2 | Liam | Engineering | 30000 |
3 | Sophia | Engineering | 15000 |
4 | Justin | Engineering | 15000 |
5 | Noah | Engineering | 10000 |
1 | Emma | HR | 8000 |
2 | Jacob | HR | 8000 |
3 | Olivia | HR | 8000 |
1 | Mason | Executive | 50000 |
1 | Ava | Marketing | 15000 |
2 | Ethan | Marketing | 9000 |
この記事はこちらの公式ドキュメントを参考にしたものです。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/row-number-u-sql
U-SQL 入門⑦ ~ U-SQL 内で行セットを作成 ~
以下の様にして行セットを U-SQL 内で生成することができます。ちょっとテストする時などに便利。
@employees = SELECT * FROM ( VALUES (1, "Noah", "Engineering", 100, 10000), (2, "Sophia", "Engineering", 100, 20000), (3, "Liam", "Engineering", 100, 30000), (4, "Emma", "HR", 200, 10000), (5, "Jacob", "HR", 200, 10000), (6, "Olivia", "HR", 200, 10000), (7, "Mason", "Executive", 300, 50000), (8, "Ava", "Marketing", 400, 15000), (9, "Ethan", "Marketing", 400, 10000) ) AS T(EmpID, EmpName, DeptName, DeptID, Salary);
上記のように書くと @employees には以下のようなデータが格納される。
EmpID | EmpName | DeptName | DeptID | Salary |
---|---|---|---|---|
1 | Noah | Engineering | 100 | 10000 |
2 | Sophia | Engineering | 100 | 20000 |
3 | Liam | Engineering | 100 | 30000 |
4 | Emma | HR | 200 | 10000 |
5 | Jacob | HR | 200 | 10000 |
6 | Olivia | HR | 200 | 10000 |
7 | Mason | Executive | 300 | 50000 |
8 | Ava | Marketing | 400 | 15000 |
9 | Ethan | Marketing | 400 | 10000 |
U-SQL 入門⑥ ~ 集約関数 ~
U-SQL にはビルトインの集約関数が提供されています。これらの集約関数は SELECT 文の中で使用できます。
集約関数の一覧
ANY_VALUE | グループの中から任意の 1 つの値を取得する |
ARRAY_AGG | グループごとに SQL.ARRAY 値を生成する |
AVG | 平均 |
COUNT | 個数のカウント |
MAP_AGG | Key, Value 値を取り SQL.MAP 値を生成する |
MAX | 最大値 |
MIN | 最小値 |
SUM | 合計 |
STDEV | 標本標準偏差 |
STDEVP | 母標準偏差 |
VAR | 標本分散 |
VARP | 母分散 |
集約関数の実行例
@rs = SELECT Region, COUNT() AS NumSessions, SUM(Duration) AS TotalDuration, AVG(Duration) AS AvgDuration, MAX(Duration) AS MaxDuration, MIN(Duration) AS MinDuration, STDEV(Duration) AS StdevDuration, STDEVP(Duration) AS StdevPDuration, VAR(Duration) AS VarianceDuration, VARP(Duration) AS VariancePDuration FROM @searchlog GROUP BY Region;
その他集約関数
ANY_VALUE
ANY_VALUE 関数はグループの中から任意の 1 つの値を取得します。
以下ドキュメントにあるサンプルクエリを実行した結果です。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/any-value-u-sql
クエリ
@employees = SELECT * FROM ( VALUES (1, "Noah", "Engineering", 100, 10000), (2, "Sophia", "Engineering", 100, 20000), (3, "Liam", "Engineering", 100, 30000), (4, "Emma", "HR", 200, 10000), (5, "Jacob", "HR", 200, 10000), (6, "Olivia", "HR", 200, 10000), (7, "Mason", "Executive", 300, 50000), (8, "Ava", "Marketing", 400, 15000), (9, "Ethan", "Marketing", 400, 10000) ) AS T(EmpID, EmpName, DeptName, DeptID, Salary); @result = SELECT DeptName, ANY_VALUE(EmpName) AS ArbitraryEmployee FROM @employees GROUP BY DeptName; OUTPUT @result TO "/Output/ReferenceGuide/any_value/exampleB.csv" USING Outputters.Csv(outputHeader: true);
実行結果
DeptName | ArbitraryEmployee |
---|---|
Engineering | Noah |
Executive | Mason |
HR | Emma |
Marketing | Ava |
ARRAY_AGG
ARRAY_AGG 関数はグループごとに SQL.ARRAY 値を生成します。
以下ドキュメントにあるサンプルクエリを実行した結果です。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/array-agg-u-sql
クエリ
@employees = SELECT * FROM ( VALUES ("Noah", "cell:030-0074321"), ("Noah", "office:030-0076545"), ("Sophia", "cell:(5) 555-4729"), ("Sophia", "office:(5) 555-3745"), ("Liam", "cell:(5) 555-3932"), ("Amy", "cell:(171) 555-7788"), ("Amy", "office:(171) 555-6750"), ("Amy", "home:(425) 555-6238"), ("Justin", "cell:0921-12 34 65"), ("Justin", "office:0921-12 34 67"), ("Emma", (string)null), ("Jacob", ""), ("Olivia", "cell:88.60.15.31"), ("Olivia", "office:88.60.15.32"), ("Mason", "cell:(91) 555 22 82"), ("Mason", "office:(91) 555 91 99"), ("Mason", "home:(425) 555-2819"), ("Ava", "cell:91.24.45.40"), ("Ava", "office:91.24.45.41"), ("Ethan", "cell:(604) 555-4729"), ("Ethan", "office:(604) 555-3745"), ("David", "cell:(171) 555-1212"), ("Andrew", "cell:(1) 135-5555"), ("Andrew", "office:(1) 135-4892"), ("Jennie", "cell:(5) 555-3392"), ("Jennie", "office:(5) 555-7293") ) AS T(EmpName, PhoneNumber); @result = SELECT EmpName, string.Join(", ", ARRAY_AGG(PhoneNumber)) AS PhoneNumbers FROM @employees WHERE !string.IsNullOrEmpty(PhoneNumber) GROUP BY EmpName; OUTPUT @result TO "/Output/ReferenceGuide/Aggregate/array_agg/exampleA.csv" USING Outputters.Csv(outputHeader: true);
実行結果
EmpName | PhoneNumbers |
---|---|
Amy | cell:(171) 555-7788, office:(171) 555-6750, home:(425) 555-6238 |
Andrew | office:(1) 135-4892, cell:(1) 135-5555 |
Ava | cell:91.24.45.40, office:91.24.45.41 |
David | cell:(171) 555-1212 |
Ethan | cell:(604) 555-4729, office:(604) 555-3745 |
Jennie | office:(5) 555-7293, cell:(5) 555-3392 |
Justin | cell:0921-12 34 65, office:0921-12 34 67 |
Liam | cell:(5) 555-3932 |
Mason | cell:(91) 555 22 82, office:(91) 555 91 99, home:(425) 555-2819 |
Noah | cell:030-0074321, office:030-0076545 |
Olivia | office:88.60.15.32, cell:88.60.15.31 |
Sophia | office:(5) 555-3745, cell:(5) 555-4729 |
MAP_AGG
MAP_AGG 関数は Key, Value 値を取り SQL.MAP 値を生成します。
以下ドキュメントにあるサンプルクエリを実行した結果です。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/map-agg-u-sql
クエリ
@employees = SELECT * FROM ( VALUES ("Noah", "cell", "030-0074321"), ("Noah", "office", "030-0076545"), ("Sophia", "cell", "(5) 555-4729"), ("Sophia", "office", "(5) 555-3745"), ("Liam", "cell", "(5) 555-3932"), ("Amy", "cell", "(171) 555-7788"), ("Amy", "office", "(171) 555-6750"), ("Amy", "home", "(425) 555-6238"), ("Justin", "cell", "0921-12 34 65"), ("Justin", "office", "0921-12 34 67"), ("Emma", (string)null, (string)null), ("Jacob", "", ""), ("Olivia", "cell", "88.60.15.31"), ("Olivia", "office", "88.60.15.32"), ("Mason", "cell", "(91) 555 22 82"), ("Mason", "office", "(91) 555 91 99"), ("Mason", "home", "(425) 555-2819"), ("Ava", "cell", "91.24.45.40"), ("Ava", "office", "91.24.45.41"), ("Ethan", "cell", "(604) 555-4729"), ("Ethan", "office", "(604) 555-3745"), ("David", "cell", "(171) 555-1212"), ("Andrew", "cell", "(1) 135-5555"), ("Andrew", "office", "(1) 135-4892"), ("Jennie", "cell", "(5) 555-3392"), ("Jennie", "office", "(5) 555-7293") ) AS T(EmpName, PhoneType, PhoneNumber); @result = SELECT EmpName, String.Join(",", MAP_AGG(PhoneType, PhoneNumber).Select(p => String.Format("{0}:{1}", p.Key, p.Value))).Trim() AS PhoneNumbers FROM @employees WHERE !string.IsNullOrWhiteSpace(PhoneType) GROUP BY EmpName; OUTPUT @result TO "/Output/ReferenceGuide/Aggregate/map_agg/exampleA.csv" USING Outputters.Csv(outputHeader: true);
実行結果
EmpName | PhoneNumbers |
---|---|
Amy | cell:(171) 555-7788,home:(425) 555-6238,office:(171) 555-6750 |
Andrew | cell:(1) 135-5555,office:(1) 135-4892 |
Ava | cell:91.24.45.40,office:91.24.45.41 |
David | cell:(171) 555-1212 |
Ethan | cell:(604) 555-4729,office:(604) 555-3745 |
Jennie | cell:(5) 555-3392,office:(5) 555-7293 |
Justin | cell:0921-12 34 65,office:0921-12 34 67 |
Liam | cell:(5) 555-3932 |
Mason | cell:(91) 555 22 82,home:(425) 555-2819,office:(91) 555 91 99 |
Noah | cell:030-0074321,office:030-0076545 |
Olivia | cell:88.60.15.31,office:88.60.15.32 |
Sophia | cell:(5) 555-4729,office:(5) 555-3745 |
この記事はこちらの公式ドキュメントを参考にしたものです。
https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/aggregate-functions-u-sql